1
00:00:00,790 --> 00:00:07,320
[Music]

2
00:00:12,510 --> 00:00:09,350
[Applause]

3
00:00:15,450 --> 00:00:12,520
and so I'm gonna be talking about random

4
00:00:18,060 --> 00:00:15,460
polypeptide sequences and their

5
00:00:22,200 --> 00:00:18,070
relevance for the origins of life and

6
00:00:25,589 --> 00:00:22,210
I'm just gonna start straight there

7
00:00:27,600 --> 00:00:25,599
without not much introduction so I would

8
00:00:30,330 --> 00:00:27,610
like to ask you to imagine first that

9
00:00:32,940 --> 00:00:30,340
the circle here includes all the

10
00:00:37,979 --> 00:00:32,950
biological sequences that our life uses

11
00:00:42,150 --> 00:00:37,989
and we know that these sequences occupy

12
00:00:44,910 --> 00:00:42,160
only a very tiny part of the possible

13
00:00:51,900 --> 00:00:44,920
sequence space and of course this is

14
00:00:54,049 --> 00:00:51,910
terribly out of out of the French but

15
00:00:56,790 --> 00:00:54,059
this is basically just to illustrate

16
00:00:58,860 --> 00:00:56,800
what we are interested in so we are

17
00:01:03,840 --> 00:00:58,870
interested and what actually lies

18
00:01:07,260 --> 00:01:03,850
outside this box or this circle at this

19
00:01:09,990 --> 00:01:07,270
case in the random sequence space and

20
00:01:12,330 --> 00:01:10,000
especially when it comes to what could

21
00:01:17,819 --> 00:01:12,340
have kind of been lying around doing the

22
00:01:20,370 --> 00:01:17,829
early origins so to start whatever

23
00:01:23,010 --> 00:01:20,380
exploration are we basically looked at

24
00:01:25,620 --> 00:01:23,020
we performed our kind of like a scarce

25
00:01:27,719 --> 00:01:25,630
experimental sampling looking around the

26
00:01:32,100 --> 00:01:27,729
biological space at some specific

27
00:01:36,120 --> 00:01:32,110
sequences that are out there and to

28
00:01:38,459 --> 00:01:36,130
start we generated 10,000 random

29
00:01:41,760 --> 00:01:38,469
sequences they were all of the same

30
00:01:45,240 --> 00:01:41,770
length 100 amino acids and the natural

31
00:01:48,899 --> 00:01:45,250
amino acid distribution and we performed

32
00:01:51,620 --> 00:01:48,909
secondary structure prediction on all of

33
00:01:54,810 --> 00:01:51,630
these using multiple predictors and

34
00:01:58,380 --> 00:01:54,820
based on these predictions we sorted the

35
00:02:02,719 --> 00:01:58,390
data out and selected some sequences

36
00:02:07,260 --> 00:02:02,729
that we looked at experimentally and

37
00:02:09,749 --> 00:02:07,270
these results showed the by informatique

38
00:02:12,090 --> 00:02:09,759
prediction of the secondary structure

39
00:02:15,720 --> 00:02:12,100
and the first panel is our random

40
00:02:17,970 --> 00:02:15,730
sequence data set and we used five

41
00:02:20,130 --> 00:02:17,980
different predictors and looked always

42
00:02:21,300 --> 00:02:20,140
at the alpha helical and beta sheet

43
00:02:23,309 --> 00:02:21,310
content

44
00:02:25,500 --> 00:02:23,319
and we did the same thing for some

45
00:02:29,040 --> 00:02:25,510
control groups the first were fragments

46
00:02:31,860 --> 00:02:29,050
from the PDP data set database then

47
00:02:35,570 --> 00:02:31,870
fragments from the uniprot database and

48
00:02:38,640 --> 00:02:35,580
also fragments from the database of

49
00:02:41,550 --> 00:02:38,650
disordered proteins and a short message

50
00:02:44,270 --> 00:02:41,560
here is that secondary structure

51
00:02:49,800 --> 00:02:44,280
actually seems to be quite abundant in

52
00:02:52,440 --> 00:02:49,810
random sequence base this plot here

53
00:02:54,930 --> 00:02:52,450
shows the same random data set in a

54
00:03:00,809 --> 00:02:54,940
slightly different way so on the y-axis

55
00:03:03,630 --> 00:03:00,819
here we have secondary structure content

56
00:03:07,350 --> 00:03:03,640
and on the x-axis we have the predicted

57
00:03:10,650 --> 00:03:07,360
disorder of these sequences so basically

58
00:03:14,370 --> 00:03:10,660
we see a whole range of sequences in our

59
00:03:16,259 --> 00:03:14,380
random data set and we have some

60
00:03:17,970 --> 00:03:16,269
sequences with high secondary structure

61
00:03:21,210 --> 00:03:17,980
content and sound with low secondary

62
00:03:23,280 --> 00:03:21,220
structure content and we selected 45

63
00:03:27,930 --> 00:03:23,290
sequences here those are the ones and in

64
00:03:31,020 --> 00:03:27,940
color for some experiments ensured this

65
00:03:34,080 --> 00:03:31,030
blood here summarizes all these

66
00:03:36,120 --> 00:03:34,090
experiments so we looked at expression

67
00:03:40,289 --> 00:03:36,130
and solubility in e.coli

68
00:03:42,750 --> 00:03:40,299
and also secondary structure using CD on

69
00:03:44,309 --> 00:03:42,760
purified proteins and I don't really

70
00:03:49,410 --> 00:03:44,319
want to go into much detail there is a

71
00:03:51,599 --> 00:03:49,420
lot of detail there but to summarize we

72
00:03:54,000 --> 00:03:51,609
basically confirm the secondary

73
00:03:56,370 --> 00:03:54,010
structure is quite abundant and random

74
00:03:59,400 --> 00:03:56,380
sequences and we did not really expect

75
00:04:01,710 --> 00:03:59,410
to see that and also there are random

76
00:04:04,979 --> 00:04:01,720
sequences that actually hi have hi

77
00:04:07,140 --> 00:04:04,989
disorder at the content seem to be

78
00:04:11,099 --> 00:04:07,150
better tolerated by living cells they

79
00:04:12,900 --> 00:04:11,109
have low aggregation propensity and we

80
00:04:15,539 --> 00:04:12,910
hypothesized that these could actually

81
00:04:21,330 --> 00:04:15,549
make them better progenitors of syllable

82
00:04:25,710 --> 00:04:21,340
and functional proteins so basically

83
00:04:28,360 --> 00:04:25,720
based on this little search we concluded

84
00:04:34,000 --> 00:04:28,370
that it's worth looking out there

85
00:04:36,030 --> 00:04:34,010
and and particularly we decided we're

86
00:04:40,300 --> 00:04:36,040
gonna do this in a more systematic way

87
00:04:43,780 --> 00:04:40,310
we are interested and what we could find

88
00:04:46,330 --> 00:04:43,790
from different alphabets and especially

89
00:04:49,689 --> 00:04:46,340
those amino acid alphabets that could be

90
00:04:52,540 --> 00:04:49,699
more relevant during early origins doing

91
00:04:55,900 --> 00:04:52,550
evolution of proteins so basically we

92
00:04:58,870 --> 00:04:55,910
wanted to explore our different subsets

93
00:05:05,230 --> 00:04:58,880
of the after protein of the sequence

94
00:05:09,040 --> 00:05:05,240
base so early libraries are designed on

95
00:05:10,840 --> 00:05:09,050
a DNA level using degenerate codons and

96
00:05:14,830 --> 00:05:10,850
we started collaborating here with

97
00:05:17,379 --> 00:05:14,840
Sasuke Fukushima from LC and the

98
00:05:19,510 --> 00:05:17,389
degenerate codons basically control the

99
00:05:22,810 --> 00:05:19,520
amino acid composition of our libraries

100
00:05:27,909 --> 00:05:22,820
so we can design libraries from

101
00:05:31,529 --> 00:05:27,919
different subsets of amino acids and to

102
00:05:34,750 --> 00:05:31,539
do this as precisely as possible we

103
00:05:36,820 --> 00:05:34,760
developed an algorithm for this called

104
00:05:40,719 --> 00:05:36,830
the coot the degenerate codon

105
00:05:43,770 --> 00:05:40,729
optimization tool and this is roughly

106
00:05:49,710 --> 00:05:43,780
what it looks like so what we can do is

107
00:05:56,400 --> 00:05:49,720
basically find solutions to almost any

108
00:05:59,200 --> 00:05:56,410
alphabet and library lengths we can also

109
00:06:02,250 --> 00:05:59,210
optimize the codon usage to for

110
00:06:05,260 --> 00:06:02,260
expression at different organisms and

111
00:06:07,540 --> 00:06:05,270
also remove some codons for reassignment

112
00:06:10,300 --> 00:06:07,550
to be able to bring in some of the

113
00:06:13,180 --> 00:06:10,310
unnatural amino acids so amino acids

114
00:06:16,330 --> 00:06:13,190
that could be pre Baddeley very relevant

115
00:06:22,839 --> 00:06:16,340
but are not part of our genetic coding

116
00:06:26,650 --> 00:06:22,849
system at the moment so using this using

117
00:06:30,250 --> 00:06:26,660
this tool we basically started making

118
00:06:33,190 --> 00:06:30,260
some of these libraries and I guess one

119
00:06:35,920 --> 00:06:33,200
of the newest thing here is that we have

120
00:06:37,630 --> 00:06:35,930
been able to express enough quantity of

121
00:06:41,950 --> 00:06:37,640
these libraries using cell free

122
00:06:44,230 --> 00:06:41,960
expression system and purify this is a

123
00:06:47,140 --> 00:06:44,240
sample of one of these libraries and I

124
00:06:50,290 --> 00:06:47,150
know this looks very ugly but the reason

125
00:06:52,450 --> 00:06:50,300
is that we have a whole distribution of

126
00:06:54,790 --> 00:06:52,460
molecular weights within this within

127
00:06:57,070 --> 00:06:54,800
this band so that's basically what you

128
00:07:01,330 --> 00:06:57,080
see here on the model spectrum as what

129
00:07:03,730 --> 00:07:01,340
we have here in the sample and of course

130
00:07:07,089 --> 00:07:03,740
we are interested and libraries that

131
00:07:07,870 --> 00:07:07,099
would best mimic what was around kind of

132
00:07:11,620 --> 00:07:07,880
lying around

133
00:07:14,140 --> 00:07:11,630
during the early origins and while we do

134
00:07:17,050 --> 00:07:14,150
have some ideas what libraries we would

135
00:07:19,810 --> 00:07:17,060
like to make here I also wanted to use

136
00:07:21,580 --> 00:07:19,820
the chance of actually being here being

137
00:07:24,550 --> 00:07:21,590
able to present and thank you to the

138
00:07:28,810 --> 00:07:24,560
organizers for that and kind of invite

139
00:07:31,450 --> 00:07:28,820
you to now give us some some feedback on

140
00:07:33,399 --> 00:07:31,460
this because we are quite often ask like

141
00:07:36,129 --> 00:07:33,409
why these amino acids and why not the

142
00:07:39,370 --> 00:07:36,139
others and why this long and not that

143
00:07:41,469 --> 00:07:39,380
long so I would like to really invite

144
00:07:46,270 --> 00:07:41,479
you to come and talk to us at some point

145
00:07:50,080 --> 00:07:46,280
and tell us if you have ideas about what

146
00:07:53,320 --> 00:07:50,090
it's more relevant and before I do that

147
00:07:56,620 --> 00:07:53,330
I'd like to acknowledge especially my

148
00:07:59,230 --> 00:07:56,630
co-workers Koski Fukushima from Elsi and

149
00:08:01,719 --> 00:07:59,240
also Stephen Freese who is newly working

150
00:08:03,189 --> 00:08:01,729
with us on this project from John

151
00:08:06,399 --> 00:08:03,199
Hopkins University

152
00:08:09,939 --> 00:08:06,409
I would also acknowledge the people that

153
00:08:12,219 --> 00:08:09,949
I work with especially these guys from

154
00:08:15,219 --> 00:08:12,229
my group at the Charles University and

155
00:08:18,490 --> 00:08:15,229
that's in Prague and it's an ancient

156
00:08:21,939 --> 00:08:18,500
University but we work at this new

157
00:08:22,839 --> 00:08:21,949
campus close to Prague so thank you for

158
00:08:31,459 --> 00:08:22,849
your attention

159
00:09:09,319 --> 00:08:34,050
Thank You Clara we have a we have time

160
00:09:09,329 --> 00:09:26,800
yeah they were all 100 amino acid long

161
00:09:26,810 --> 00:09:36,309
[Music]

162
00:09:43,639 --> 00:09:40,729
yeah so the first study that I talked

163
00:09:46,129 --> 00:09:43,649
about was basically just looking at what

164
00:09:49,009 --> 00:09:46,139
we can find at the random secrets place

165
00:09:52,819 --> 00:09:49,019
but not that much relevance to origins

166
00:09:55,309 --> 00:09:52,829
of life so we chose that line of protein

167
00:09:58,699 --> 00:09:55,319
basically just because it's easier to

168
00:10:01,909 --> 00:09:58,709
break with and it was for us the first

169
00:10:03,590 --> 00:10:01,919
kind of little search so the proteins

170
00:10:07,909 --> 00:10:03,600
that we want to be looking at now would

171
00:10:10,879 --> 00:10:07,919
be much shorter yeah ranging from

172
00:10:20,100 --> 00:10:10,889
something between 20 and 60 is what we

173
00:10:20,110 --> 00:10:32,160
yeah

174
00:10:32,170 --> 00:10:35,810
hmm

175
00:10:35,820 --> 00:11:00,540
this sort of observe

176
00:11:05,199 --> 00:11:03,189
yes thank you this is something you have

177
00:11:08,019 --> 00:11:05,209
been thinking about a lot and talking to

178
00:11:10,269 --> 00:11:08,029
the disordered kind of protein community

179
00:11:12,129 --> 00:11:10,279
about it and I do agree that the

180
00:11:15,250 --> 00:11:12,139
disorder that we see is different than

181
00:11:18,879 --> 00:11:15,260
the disorder that they see first thing

182
00:11:21,910 --> 00:11:18,889
there is a very strong compositional

183
00:11:25,629 --> 00:11:21,920
bias in today's IDP proteins that are

184
00:11:28,389 --> 00:11:25,639
mostly eukaryotic and yeah of course

185
00:11:42,040 --> 00:11:28,399
it's so usually very functional what we

186
00:11:46,329 --> 00:11:42,050
see is a different case of disorder hi

187
00:11:51,069 --> 00:11:46,339
Mottola yes so we have so far been

188
00:11:54,069 --> 00:11:51,079
looking at comparing the full alphabet

189
00:11:56,470 --> 00:11:54,079
with different versions of what's

190
00:12:00,370 --> 00:11:56,480
considered the early alphabet meaning

191
00:12:04,480 --> 00:12:00,380
mostly just those amino acids there are

192
00:12:08,800 --> 00:12:04,490
biological today coded and were

193
00:12:10,870 --> 00:12:08,810
considered to be prebiotic available so

194
00:12:15,819 --> 00:12:10,880
that's what have we been doing so far

195
00:12:18,370 --> 00:12:15,829
but it's mostly work in progress right

196
00:12:21,069 --> 00:12:18,380
so I had a question um so it for your

197
00:12:23,769 --> 00:12:21,079
newer libraries do you have a way of

198
00:12:24,790 --> 00:12:23,779
dealing dealing with sort of single

199
00:12:26,530 --> 00:12:24,800
nucleotide insertions or deletions

200
00:12:28,300 --> 00:12:26,540
because I could imagine that if your if

201
00:12:29,949 --> 00:12:28,310
your library say when it's transcribed

202
00:12:31,840 --> 00:12:29,959
and your cell free translation system

203
00:12:34,720 --> 00:12:31,850
picks up a single insertion or deletion

204
00:12:35,829 --> 00:12:34,730
throws off your whole codon design is is

205
00:12:37,660 --> 00:12:35,839
there something about the system that

206
00:12:41,530 --> 00:12:37,670
would prevent that those from from

207
00:12:43,620 --> 00:12:41,540
coming up yes that's a good question it

208
00:12:47,650 --> 00:12:43,630
doesn't seem to be happening too much

209
00:12:51,069 --> 00:12:47,660
within our new library so we use we

210
00:12:53,879 --> 00:12:51,079
didn't use a lot of PCR not too much

211
00:12:57,759 --> 00:12:53,889
anyway and try to use high fidelity

212
00:13:01,300 --> 00:12:57,769
enzymes on the way so

213
00:13:03,220 --> 00:13:01,310
but after sequencing our library on the